BRFSS Exploratory Data Analysis (EDA)¶

This notebook focuses on preparing and analyzing self-reported mental health variables from the 2022 BRFSS dataset. It includes variable cleaning, encoding of ordinal responses, handling of missing values, and construction of a composite Depression Index (DI). The notebook also compares different imputation strategies and visualizes their geographic effects across U.S. states.

Table of Contents¶

  1. Introduction
  2. How This Dataframe Was Created
  3. Why This Analysis Matters
  4. Notebook Structure
  5. Data Loading & Initial Exploration
  6. Computing the Depression Index
  7. Filling Missing DI Values: Neighboring State & National Mean Approaches
  8. Visualizing the Depression Index
  9. Findings & Next Steps

Exploring the BRFSS Data and Depression Index¶

Introduction¶

This notebook presents an independent exploratory analysis of the Behavioral Risk Factor Surveillance System (BRFSS) 2022 dataset before integrating it with additional factors. The primary objective is to analyze self-reported perceptions of depression across U.S. states and establish a structured dataset for future research.

How This Dataframe Was Created¶

The BRFSS dataset used in this analysis originates from Vikram's notebook (BRFSS&DAYLIGHT.ipynb). To ensure a structured and modular approach:

  • Extracted the BRFSS dataset from Vikram’s notebook.
  • Saved it as a CSV file for reproducibility and easier manipulation.
  • Uploaded the dataset into this working environment for independent analysis and visualization.

This notebook documents the cleaning, transformation, and visualization steps to maintain transparency in the analytical process. The findings will later contribute to our milestone project’s final analysis.

Why This Analysis Matters¶

Improving Mental Health Awareness¶

Understanding self-reported depression levels across states provides a structured baseline for further research and potential interventions.

Structured Data for Future Analysis¶

This notebook ensures proper cleaning and imputation before integrating additional factors, allowing for reproducible and scalable research.

Ensuring Data Integrity¶

Proper handling of missing values and categorical variables helps avoid biases in the final Depression Index (DI), ensuring more reliable insights.

Laying the Groundwork for Expanded Research¶

The structured Depression Index can be used to explore external influences and long-term trends in future analyses.

Notebook Structure¶

This notebook is structured as follows:

  1. Data Loading & Initial Exploration – Importing the dataset, checking structure, and handling missing values.
  2. Computing the Depression Index (DI) – Defining and calculating the Depression Index based on self-reported survey responses.
  3. Handling Missing DI Values – Addressing missing data using different imputation strategies.
  4. Visualizing the Depression Index – Generating choropleth maps and other plots to examine regional trends.
  5. Findings & Next Steps – Summarizing key takeaways and outlining future integration with additional factors.

Data Loading & Initial Exploration¶

This section focuses on:

  • Importing the Behavioral Risk Factor Surveillance System (BRFSS) dataset.
  • Mapping state abbreviations to full names for better readability.
  • Verifying dataset structure, checking for missing values, and inspecting the first few rows.

Understanding the dataset’s structure ensures accurate transformations in later stages.

Importing the Dataset¶

The dataset is loaded into a Pandas DataFrame from a CSV file.

  • Ensures clarity by defining the file path explicitly.
  • Facilitates reproducibility for future use.
In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
from sklearn.preprocessing import MinMaxScaler
import plotly.express as px
import us  # For state name conversion
import matplotlib.pyplot as plt
import warnings  # Suppress warnings

# Suppress warnings
warnings.filterwarnings("ignore")

# Load the dataset
PATH = "data/"  # Define file path for clarity
file = "brfss_2022.csv"
scoringdf = pd.read_csv(PATH + file)

scoringdf = scoringdf[scoringdf["IYEAR"] == 2022].copy()

Mapping State Codes to Two-Letter Abbreviations¶

Objective¶

To ensure consistency across visualizations and analysis, this step converts state codes (FIPS codes or full state names) into two-letter state abbreviations (e.g., CA, TX, NY).

Methodology¶

  1. Select Relevant Columns: Extract necessary columns for scoring, ensuring that state codes are included for mapping.
  2. Convert State Codes: Map FIPS state codes to their respective two-letter abbreviations using a predefined dictionary.
  3. Filter Valid States: Remove any states or territories that do not align with the analysis scope.

Implementation¶

In [2]:
scoringcols = ['_STATE','MENTHLTH', 'POORHLTH', 'ADDEPEV3', 'LSATISFY', 'EMTSUPRT', 'SDHISOLT', 'SDHSTRE1']
scoringdf = scoringdf[scoringcols]


state_mapping = {
    1: 'AL', 2: 'AK', 4: 'AZ', 5: 'AR', 6: 'CA',
    8: 'CO', 9: 'CT', 10: 'DE', 11: 'DC', 12: 'FL', 13: 'GA',
    15: 'HI', 16: 'ID', 17: 'IL', 18: 'IN', 19: 'IA', 20: 'KS', 21: 'KY', 22: 'LA',
    23: 'ME', 24: 'MD', 25: 'MA', 26: 'MI', 27: 'MN', 28: 'MS', 29: 'MO', 30: 'MT',
    31: 'NE', 32: 'NV', 33: 'NH', 34: 'NJ', 35: 'NM', 36: 'NY', 37: 'NC', 38: 'ND',
    39: 'OH', 40: 'OK', 41: 'OR', 42: 'PA', 44: 'RI', 45: 'SC', 46: 'SD', 47: 'TN',
    48: 'TX', 49: 'UT', 50: 'VT', 51: 'VA', 53: 'WA', 54: 'WV', 55: 'WI', 56: 'WY',
    66: 'GU', 72: 'PR', 78: 'VI'  # Territories: Guam, Puerto Rico, Virgin Islands
}


scoringdf['_STATE'] = scoringdf['_STATE'].map(state_mapping)


valid_states = {
    'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA',
    'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',
    'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
    'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC',
    'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY',
    'PR'  # Puerto Rico
}

scoringdf = scoringdf[scoringdf['_STATE'].isin(valid_states)]
scoringdf.head()
Out[2]:
_STATE MENTHLTH POORHLTH ADDEPEV3 LSATISFY EMTSUPRT SDHISOLT SDHSTRE1
0 AL 88.0 NaN 2.0 1.0 1.0 5.0 4.0
1 AL 88.0 NaN 2.0 1.0 1.0 5.0 5.0
2 AL 3.0 2.0 2.0 2.0 2.0 3.0 5.0
3 AL 88.0 NaN 2.0 1.0 1.0 3.0 5.0
4 AL 88.0 88.0 2.0 1.0 1.0 5.0 5.0

Verifying Data Integrity¶

This block confirms:

  • Successful state name mapping after conversion.
  • Dataset structure to check data types and completeness.
  • A preview of the first few rows for initial inspection.
In [3]:
# Display unique states present after mapping
unique_states = scoringdf['_STATE'].unique()
print(f"Updated state values in data ({len(unique_states)} states):\n{unique_states}")
Updated state values in data (51 states):
['AL' 'AK' 'AZ' 'AR' 'CA' 'CO' 'CT' 'DE' 'FL' 'GA' 'HI' 'ID' 'IL' 'IN'
 'IA' 'KS' 'KY' 'LA' 'ME' 'MD' 'MA' 'MI' 'MN' 'MS' 'MO' 'MT' 'NE' 'NV'
 'NH' 'NJ' 'NM' 'NY' 'NC' 'ND' 'OH' 'OK' 'OR' 'PA' 'RI' 'SC' 'SD' 'TN'
 'TX' 'UT' 'VT' 'VA' 'WA' 'WV' 'WI' 'WY' 'PR']
In [4]:
# Display dataset structure
print("\nDataset Information:")
scoringdf.info()
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
Index: 412700 entries, 0 to 443600
Data columns (total 8 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   _STATE    412700 non-null  object 
 1   MENTHLTH  412697 non-null  float64
 2   POORHLTH  236879 non-null  float64
 3   ADDEPEV3  412693 non-null  float64
 4   LSATISFY  235844 non-null  float64
 5   EMTSUPRT  235527 non-null  float64
 6   SDHISOLT  235228 non-null  float64
 7   SDHSTRE1  232916 non-null  float64
dtypes: float64(7), object(1)
memory usage: 28.3+ MB
In [5]:
# Show the first few rows of the dataset
print("\nFirst few rows of the dataset:")
print(scoringdf.head())
First few rows of the dataset:
  _STATE  MENTHLTH  POORHLTH  ADDEPEV3  LSATISFY  EMTSUPRT  SDHISOLT  SDHSTRE1
0     AL      88.0       NaN       2.0       1.0       1.0       5.0       4.0
1     AL      88.0       NaN       2.0       1.0       1.0       5.0       5.0
2     AL       3.0       2.0       2.0       2.0       2.0       3.0       5.0
3     AL      88.0       NaN       2.0       1.0       1.0       3.0       5.0
4     AL      88.0      88.0       2.0       1.0       1.0       5.0       5.0

Checking for Missing Values¶

This step:

  • Identifies columns with missing values to guide data cleaning.
  • Displays only relevant columns where missing values exist.
In [6]:
# Count missing values per column and display only non-zero entries
missing_values = scoringdf.isna().sum()
print("\nMissing Values Before Processing:")
print(missing_values[missing_values > 0])
Missing Values Before Processing:
MENTHLTH         3
POORHLTH    175821
ADDEPEV3         7
LSATISFY    176856
EMTSUPRT    177173
SDHISOLT    177472
SDHSTRE1    179784
dtype: int64

Generating Summary Statistics¶

This provides:

  • Basic statistical insights into numeric variables.
  • A quick check for inconsistencies in value ranges.
In [7]:
# Display summary statistics for numeric variables
print("\nSummary Statistics for Numeric Variables:")
print(scoringdf.describe())
Summary Statistics for Numeric Variables:
            MENTHLTH       POORHLTH       ADDEPEV3       LSATISFY  \
count  412697.000000  236879.000000  412693.000000  235844.000000   
mean       58.385392      51.877190       1.827782       1.684007   
std        37.842250      38.847974       0.609070       0.914879   
min         1.000000       1.000000       1.000000       1.000000   
25%        14.000000       7.000000       2.000000       1.000000   
50%        88.000000      88.000000       2.000000       2.000000   
75%        88.000000      88.000000       2.000000       2.000000   
max        99.000000      99.000000       9.000000       9.000000   

            EMTSUPRT       SDHISOLT       SDHSTRE1  
count  235527.000000  235228.000000  232916.000000  
mean        1.953886       4.048944       3.898616  
std         1.276959       1.128562       1.192885  
min         1.000000       1.000000       1.000000  
25%         1.000000       3.000000       3.000000  
50%         2.000000       4.000000       4.000000  
75%         2.000000       5.000000       5.000000  
max         9.000000       9.000000       9.000000  

Handling Invalid Responses¶

Certain survey responses include special codes (e.g., 77 = Refused, 88 = None, 99 = Unknown).
These values do not provide meaningful data, so we replace them with appropriate values to ensure clean analysis.

Imputation Strategy:¶

  • 88 is mapped to 0 for MENTHLTH and POORHLTH (indicating "No days of poor health").
  • 77 and 99 are replaced with NaN (missing values) as they represent invalid or unknown responses.
  • Responses outside the valid range are also replaced with NaN.

Valid Range Adjustments:¶

  • Previously, the valid range for MENTHLTH and POORHLTH was 1-30.
  • Now, the range is updated to 0-30, since 88 (previously "None") is now treated as 0 days.

By handling these values carefully, we ensure that the dataset reflects meaningful health trends without introducing bias.

In [8]:
# Define invalid response values from the BRFSS codebook
invalid_values = {
    "MENTHLTH": [77, 99],  # Keep 77 and 99 as invalid, replace 88 separately
    "POORHLTH": [77, 99],
    "ADDEPEV3": [7, 9],
    "LSATISFY": [7, 9],
    "EMTSUPRT": [7, 9],
    "SDHISOLT": [7, 9],
    "SDHSTRE1": [7, 9]
}

# Replace invalid values with NaN for proper handling
for column, values in invalid_values.items():
    scoringdf[column] = scoringdf[column].replace(values, np.nan)

# Replace 88 with 0 in MENTHLTH and POORHLTH
scoringdf["MENTHLTH"] = scoringdf["MENTHLTH"].replace(88, 0)
scoringdf["POORHLTH"] = scoringdf["POORHLTH"].replace(88, 0)

# Define valid ranges for numeric variables (now 0-30)
valid_ranges = {
    "MENTHLTH": range(0, 31),
    "POORHLTH": range(0, 31)
}

# Keep only values within valid ranges
for column, valid_range in valid_ranges.items():
    scoringdf[column] = scoringdf[column].apply(lambda x: x if x in valid_range else np.nan)

Binary Encoding: Depression Diagnosis¶

The ADDEPEV3 column represents whether a respondent has ever been diagnosed with depression:

  • 1 = Yes → Converted to 1
  • 2 = No → Converted to 0
  • Missing values (NaN) remain unchanged to avoid introducing bias.
In [9]:
# Convert ADDEPEV3 (Depression Diagnosis) into binary format
scoringdf["ADDEPEV3"] = scoringdf["ADDEPEV3"].map({1: 1, 2: 0})

# Ensure the column remains numeric for future calculations
scoringdf["ADDEPEV3"] = scoringdf["ADDEPEV3"].astype("float")

Ordering Ordinal Variables¶

Several survey questions use ordinal scales where higher values indicate greater distress.
To ensure correct interpretation, we convert them into ordered categorical variables:

  • Life Satisfaction (LSATISFY): 1 = Very Satisfied → 4 = Very Dissatisfied
  • Emotional Support (EMTSUPRT): 1 = Always → 5 = Never
  • Social Isolation (SDHISOLT): 1 = Never → 5 = Always
  • Stress Frequency (SDHSTRE1): 1 = Never → 5 = Always
In [10]:
# Define correct ordering for ordinal variables
ordinal_mappings = {
    "LSATISFY": [1, 2, 3, 4],  # 1 = Very Satisfied → 4 = Very Dissatisfied
    "EMTSUPRT": [1, 2, 3, 4, 5],  # 1 = Always → 5 = Never
    "SDHISOLT": [1, 2, 3, 4, 5],  # 1 = Never → 5 = Always
    "SDHSTRE1": [1, 2, 3, 4, 5]   # 1 = Never → 5 = Always
}

# Convert ordinal variables to ordered categorical format
for col, categories in ordinal_mappings.items():
    cat_type = CategoricalDtype(categories=categories, ordered=True)
    scoringdf[col] = scoringdf[col].astype(cat_type)

Handling Missing Values¶

Since survey responses contain missing values, we test different imputation strategies:

  • Median Imputation: Fills missing values with the median.
  • Mean Imputation: Fills missing values with the mean.
  • Zero Imputation: Replaces missing values with 0 (used for comparison).
  • Mode Imputation: Replaces missing values with the most frequent value.
  • No Imputation: Keeps missing values as NaN for reference.
In [11]:
def impute_column(df, col, method):
    """Imputes missing values for a single column based on the given method."""
    if method == 'median':
        return df[col].fillna(df[col].median())
    elif method == 'mean':
        return df[col].fillna(df[col].mean())
    elif method == 'zero':
        return df[col].fillna(0)
    elif method == 'mode':
        return df[col].fillna(df[col].mode()[0] if not df[col].mode().empty else np.nan)
    return df[col]  # If method is 'none', return unchanged
In [12]:
def impute_missing_values(df, method='median'):
    """
    Imputes missing values in the dataset while preserving variable structure.

    Parameters:
    df (DataFrame): The dataset containing missing values.
    method (str): The imputation method ('median', 'mean', 'zero', 'mode', or 'none').

    Returns:
    DataFrame: The dataset with imputed values.
    """
    df = df.copy()  # Avoid modifying the original DataFrame

    numeric_vars = ["MENTHLTH", "POORHLTH"]
    ordinal_vars = ["LSATISFY", "EMTSUPRT", "SDHISOLT", "SDHSTRE1"]

    # Convert ordinal variables to float
    df[ordinal_vars] = df[ordinal_vars].astype("float")

    # Apply imputation method to numeric and ordinal variables
    for col in numeric_vars + ordinal_vars:
        df[col] = impute_column(df, col, method)

    return df

# Apply median imputation
scoringdf_imputed = impute_missing_values(scoringdf.copy(), method="median")

Computing the Depression Index¶

The Depression Index (DI) is a composite score designed to capture self-reported mental health distress.

Variables Included in DI Calculation¶

  • MENTHLTH – Days of poor mental health
  • POORHLTH – Days of poor physical health
  • LSATISFY – Life satisfaction (ordinal scale)
  • EMTSUPRT – Emotional support availability
  • SDHISOLT – Social isolation frequency
  • SDHSTRE1 – Frequency of stress
  • ADDEPEV3 – Depression Diagnosis (binary variable, higher weight)

This index is calculated using weighted sums and normalized values.

In [13]:
from sklearn.preprocessing import MinMaxScaler

def compute_DI(df):
    """
    Computes the Depression Index (DI) while incorporating ADDEPEV3 as a binary risk factor.

    Parameters:
    df (DataFrame): Processed dataset with imputed values.

    Returns:
    DataFrame: Dataset with computed DI.
    """
    df = df.copy()  # Prevent modifying original DataFrame

    # Define weights for DI calculation (higher weights indicate stronger contributions)
    weights = {
        'MENTHLTH': 8,  # Days of poor mental health
        'POORHLTH': 5,  # Days of poor physical health
        'LSATISFY': 8,  # Life satisfaction (ordinal)
        'EMTSUPRT': 3,  # Emotional support (ordinal)
        'SDHISOLT': 7,  # Social isolation (ordinal)
        'SDHSTRE1': 3,  # Stress frequency (ordinal)
        'ADDEPEV3': 10  # Binary depression diagnosis (higher weight)
    }

    # Reverse ordinal scales so that higher values indicate more distress
    reverse_mappings = {
        'LSATISFY': {1: 4, 2: 3, 3: 2, 4: 1},
        'SDHISOLT': {1: 5, 2: 4, 3: 3, 4: 2, 5: 1},
        'SDHSTRE1': {1: 5, 2: 4, 3: 3, 4: 2, 5: 1}
    }
    df.replace(reverse_mappings, inplace=True)

    # Ensure all required columns exist before proceeding
    missing_cols = [col for col in weights.keys() if col not in df.columns]
    if missing_cols:
        raise ValueError(f"Missing required columns for DI computation: {missing_cols}")

    # Normalize only non-binary variables
    scaler = MinMaxScaler()
    normalize_cols = [col for col in weights.keys() if col != 'ADDEPEV3']
    df[normalize_cols] = scaler.fit_transform(df[normalize_cols])

    # Compute DI as a weighted sum of scaled variables
    df['DI'] = sum(df[col] * weight for col, weight in weights.items() if col in df.columns)

    # Normalize DI to ensure it falls between 0 and 1
    df['DI'] = (df['DI'] - df['DI'].min()) / (df['DI'].max() - df['DI'].min())

    return df

# Compute the Depression Index on the already-imputed dataset
scoringdf_imputed = compute_DI(scoringdf_imputed)

Filling Missing DI Values: Neighboring State & National Mean Approaches¶

Objective¶

This section addresses missing values in the Depression Index (DI) by:

  • Identifying states with missing DI values to assess the extent of the issue.
  • Applying an imputation strategy using neighboring states' DI averages.
  • Applying a fallback mechanism to replace any remaining missing values with the national mean.
  • Ensuring every state has a valid DI score, allowing for a complete dataset in the analysis.

Approach Summary¶

The missing DI values are addressed using a two-step imputation strategy:

  1. Neighboring State Imputation – If a state has missing DI values, we estimate its DI using the average DI of its geographically closest neighboring states. This approach helps maintain regional consistency in the data.
  2. National Mean Fallback – If a state lacks sufficient neighboring data, the overall national mean DI is used as a backup to ensure every state has a complete DI value.

These approaches help maintain the integrity of the dataset while minimizing biases introduced by missing values.

In [14]:
# Identify states with missing DI values
missing_state_di = scoringdf_imputed[scoringdf_imputed['DI'].isna()]['_STATE'].value_counts().reset_index()
missing_state_di.columns = ['State', 'Missing DI Count']

# Sort by count (descending) for better readability
missing_state_di = missing_state_di.sort_values(by="Missing DI Count", ascending=False)
In [15]:
# Get top and bottom 5 states with missing DI values
top_missing = missing_state_di.head(5)
bottom_missing = missing_state_di.tail(5)

print("\nTop 5 States with Highest Missing DI Values:\n", top_missing)
print("\nBottom 5 States with Lowest Missing DI Values:\n", bottom_missing)
Top 5 States with Highest Missing DI Values:
   State  Missing DI Count
0    WA               160
1    OH               117
2    NY               117
3    TX               108
4    MN               100

Bottom 5 States with Lowest Missing DI Values:
    State  Missing DI Count
46    NM                18
47    WY                18
48    DE                17
49    NV                13
50    PR                 9

Imputation Strategy: Neighboring Mean & National Mean¶

To handle missing DI values, a two-step approach is applied:

  1. Neighboring State Imputation: If a state is missing a DI value, the average DI of its geographically closest neighboring states is used.
  2. National Mean Fallback: If a state lacks sufficient neighboring data, the overall national DI mean is used.

This method ensures:

  • More accurate regional representation, since neighboring states often share socioeconomic and health patterns.
  • A complete dataset that allows for more reliable comparisons across states.
In [16]:
# Define neighboring states using 2-letter state codes
state_neighbors = {
    "NY": ["NJ", "CT", "PA"], "OH": ["PA", "MI", "IN"], "TX": ["OK", "LA", "NM", "AR"],
    "CA": ["NV", "OR", "AZ"], "FL": ["GA", "AL"], "IL": ["IN", "IA", "WI"],
    "NC": ["SC", "VA", "TN"], "GA": ["AL", "SC", "TN"], "WA": ["OR", "ID"],
    "CO": ["UT", "NE", "KS"], "VA": ["NC", "WV", "MD"], "LA": ["TX", "MS", "AR"],
    "PA": ["OH", "NY", "NJ"], "SC": ["GA", "NC"], "OK": ["TX", "KS", "AR"],
    "MI": ["OH", "IN", "WI"], "MO": ["KS", "IL", "AR"], "TN": ["KY", "NC", "AL"],
    "WI": ["IL", "MN", "MI"], "KY": ["TN", "IN", "OH"], "MN": ["ND", "WI", "IA"],
    "MD": ["DE", "VA", "PA"], "NJ": ["NY", "PA", "DE"], "NE": ["KS", "CO", "IA"],
    "SD": ["ND", "NE", "MN"], "ND": ["SD", "MN", "MT"], "OR": ["WA", "CA", "ID"],
    "MT": ["ID", "ND", "WY"], "ID": ["MT", "OR", "WA"], "WV": ["VA", "KY", "OH"],
    "AR": ["MO", "TN", "LA"], "NV": ["CA", "UT", "AZ"], "AL": ["GA", "MS", "TN"],
    "MS": ["AL", "LA", "AR"], "ME": ["NH"], "NH": ["ME", "VT"], "VT": ["NH", "NY"],
    "RI": ["MA", "CT"], "DE": ["MD", "NJ"], "HI": ["CA"], "AK": ["WA"], "PR": ["FL"]
}
In [17]:
def fill_missing_DI(df, method="neighboring_mean"):
    """
    Imputes missing Depression Index (DI) values using:
    - **Neighboring State Mean** (default): Uses the average DI of closest states.
    - **National Mean Fallback**: Uses the national DI mean if no neighbors exist.

    Parameters:
    df (DataFrame): Dataset with DI values.
    method (str): 'neighboring_mean' (default) or 'national_mean'.

    Returns:
    DataFrame: Updated dataset with missing DI values imputed.
    """
    df = df.copy()
    di_by_state = df.groupby('_STATE')['DI'].mean()

    # Apply neighboring state mean imputation
    if method == "neighboring_mean":
        missing_states = di_by_state[di_by_state.isna()].index
        for state in missing_states:
            neighbors = state_neighbors.get(state, [])
            valid_neighbors = [neighbor for neighbor in neighbors if neighbor in di_by_state.dropna().index]

            if valid_neighbors:  # Use neighboring states' DI mean
                df.loc[df['_STATE'] == state, 'DI'] = di_by_state.loc[valid_neighbors].mean()

    # Fallback: Fill remaining missing DI values with national mean
    if df['DI'].isna().sum() > 0:
        df['DI'] = df['DI'].fillna(df['DI'].mean())

    return df
In [18]:
# Apply the missing DI imputation function
scoringdf_imputed = fill_missing_DI(scoringdf_imputed, method="neighboring_mean")

Verifying Imputation Results¶

The final verification ensures:

  • All missing DI values have been filled.
  • No state is left without a valid DI value.
In [19]:
# Verify imputation success by checking remaining missing DI values
final_missing_count = scoringdf_imputed["DI"].isna().sum()

# Confirm all missing values have been handled
print(f"\nFinal Missing DI Values After Imputation: {final_missing_count}")
if final_missing_count == 0:
    print("All states now have valid DI values.")
else:
    print("Some states still have missing DI values.")

# Display final state-wise DI values
#di_by_state = scoringdf_imputed.groupby('_STATE')['DI'].mean().sort_values()
#print(di_by_state)
Final Missing DI Values After Imputation: 0
All states now have valid DI values.

Visualizing the Depression Index¶

Objective¶

Missing data can significantly influence analytical outcomes. This section explores the impact of different imputation strategies on the Depression Index (DI) by visualizing state-wise differences using a choropleth map.

The four imputation strategies evaluated:

  • Mean Imputation: Missing values are replaced with the mean of the column.
  • Median Imputation: Missing values are replaced with the median (less sensitive to outliers).
  • Zero Imputation: Missing values are treated as zero.
  • No Imputation: Missing values are left as-is for comparison.

Methodology¶

  1. State Mapping: Ensure all state codes remain in two-letter format (e.g., CA, TX, NY) to align with mapping libraries.
  2. Color Scale:
    • High Depression Index (More Distressed) = Dark Blue
    • Low Depression Index (Less Distressed) = Yellow
  3. Visualization: Create a choropleth map for each imputation method to compare regional patterns.

These visualizations provide insight into how different imputation techniques influence the geographical distribution of self-reported depression across the U.S.

In [20]:
import plotly.express as px

def generate_choropleth(df, title):
    """
    Generate a choropleth map for the Depression Index (DI) across U.S. states.

    Parameters:
    df (DataFrame): DataFrame containing DI values per state.
    title (str): Title for the choropleth map.
    """
    # Compute the average DI per state
    state_di = df.groupby('_STATE', as_index=False)['DI'].mean()

    # Ensure that state abbreviations are in uppercase to match geojson
    state_di['_STATE'] = state_di['_STATE'].str.upper()

    fig = px.choropleth(
        state_di,
        locations='_STATE',
        locationmode="USA-states",  # Ensures mapping using 2-letter abbreviations
        color='DI',
        color_continuous_scale=["yellow", "blue"],  # Adjusted to match requested color scheme
        title=title,
        scope="usa",
        labels={'DI': 'Depression Index'}
    )

    # Adjust color scale dynamically to ensure consistency across maps
    fig.update_layout(
        coloraxis=dict(
            cmin=state_di['DI'].min(),
            cmax=state_di['DI'].max()
        ),
        coloraxis_colorbar=dict(
            title="Depression Index",
            tickvals=[state_di['DI'].min(), state_di['DI'].mean(), state_di['DI'].max()],
            ticktext=[f"Low ({state_di['DI'].min():.2f})",
                      f"Medium ({state_di['DI'].mean():.2f})",
                      f"High ({state_di['DI'].max():.2f})"],
            len=0.75,  # Taller color bar
            thickness=12,  # Make the color bar thinner
            y=0.5,  # Center the color bar
        ),
        width=900,  # Set a fixed width to prevent excessive stretching
        height=500,  # Adjust height to keep the map proportional
        margin=dict(l=50, r=50, t=50, b=50)  # Adjust margins for centering
    )

    fig.show()
In [21]:
# Dictionary to store DI values based on different imputation methods
imputed_di_results = {}

# Compute DI for each imputation method and store results
for method in ["mean", "median", "zero", "no"]:
    imputed_df = impute_missing_values(scoringdf.copy(), method=method if method != "no" else "none")
    imputed_df = compute_DI(imputed_df)
    imputed_di_results[method] = imputed_df

# Generate choropleth maps for each imputation method
for method, df in imputed_di_results.items():
    generate_choropleth(df.assign(_STATE=scoringdf['_STATE']),
                        f"Depression Index Choropleth Map ({'No' if method == 'no' else method.capitalize()} Imputation)")

Findings & Next Steps¶

Key Observations¶

  1. Regional Patterns in the Depression Index (DI)

    • Higher DI values remain concentrated in parts of the South and Midwest, particularly in West Virginia, Kentucky, and Tennessee across all imputation strategies.
    • Some Midwestern states (e.g., Nebraska, Dakotas) no longer show consistently lower DI values across imputation methods.
    • Hawaii appears to have lower DI values in the imputed datasets but is missing from the No Imputation dataset, meaning its exact DI without imputation is unknown.
  2. Impact of Imputation on DI Calculations

    • Mean and median imputation yield similar results, though median imputation appears less influenced by extreme values.
    • Zero imputation results in much lower DI scores overall, potentially underestimating depression indicators in states with missing data.
    • No imputation leaves gaps in state-level analysis, causing inconsistencies in regional trends.
    • Mapping 88 → 0 in MENTHLTH and POORHLTH reflects no days of poor mental/physical health, improving interpretability.
    • Updating the valid range to 0-30 allows a more accurate representation of the reported number of poor health days.
  3. Limitations & Considerations

    • State-Level Aggregation: State averages mask individual-level variations—future analysis should explore subgroup differences such as age and gender.
    • Self-Reporting Bias: The BRFSS dataset relies on self-reported mental health perceptions, which may introduce biases in the DI calculation.
    • Need for Additional Factors: External influences such as sunlight exposure, seasonal effects, and socioeconomic conditions should be integrated to improve interpretability.
    • Possible Underreporting: Individuals reporting 88 as "None" may have different characteristics than those who reported explicitly selecting 0. Further validation is needed to ensure that this adjustment aligns with the survey's intended interpretation.

Next Steps¶

  1. Incorporate Daylight Exposure Data

    • Investigate how variations in sunlight duration correlate with DI across states.
    • Compare states with prolonged daylight hours to those with shorter exposure periods.
  2. Analyze Gender & Age Differences

    • Examine gender-based differences in DI to understand potential disparities.
    • Segment DI data by age groups to identify vulnerable populations.
  3. Expand Contextual Factors

    • Explore external socioeconomic factors, including income levels, healthcare access, and social support networks.
    • If available, analyze county-level variations instead of relying solely on state averages.

Record Dependencies¶

In [22]:
%load_ext watermark
%watermark
%watermark --iversions
Last updated: 2025-02-17T02:33:19.801273+00:00

Python implementation: CPython
Python version       : 3.10.11
IPython version      : 8.17.2

Compiler    : GCC 11.3.0
OS          : Linux
Release     : 6.5.0-1020-aws
Machine     : x86_64
Processor   : x86_64
CPU cores   : 64
Architecture: 64bit

numpy     : 1.24.3
matplotlib: 3.7.1
us        : 3.2.0
pandas    : 2.0.2
plotly    : 5.24.1
sklearn   : 1.2.2